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Previous algorithms for constructing regression tree models for 
longitudinal and multiresponse data have mostly followed the CART 
approach. Consequently, they inherit the same selection biases and 
(J_) 1 computational difficulties as CART. We propose an alternative, based 

, on the GUIDE approach, that treats each longitudinal data series as 

a curve and uses chi-squared tests of the residual curve patterns to 
select a variable to split each node of the tree. Besides being unbiased, 
the method is applicable to data with fixed and random time points 
and with missing values in the response or predictor variables. Simu- 
lation results comparing its mean squared prediction error with that 
of MVPART are given, as well as examples comparing it with stan- 
dard linear mixed effects and generalized estimating equation models. 
Conditions for asymptotic consistency of regression tree function es- 
timates are also given. 

1. Introduction. A regression tree model is a nonparametric estimate 
of a regression function constructed by recursively partitioning a data set 
with the values of its predictor X variables. CART (Breiman et al., 1984) 
is one of the oldest algorithms. It yields a piecewise-constant estimate by 
recursively partitioning the data using binary splits of the form X < c if X 
is ordinal, and X G A if X is categorical. The impurity of a node t of the 
tree is defined as the sum of squared deviations i(t) = J2(y~yt) 2 , where yt is 
the sample mean of response variable Y in t and the sum is over the y values 
in t. The split of t into subnodes and tn that maximizes the reduction 
in node impurity i{t) — i(t£) — i(tn) is selected. Partitioning continues until 
either the X or the y values are constant in a node, or the node sample size 
is below a pre-specified threshold. Then the tree is pruned with the help of 
an independent test sample or by cross-validation and the subtree with the 
lowest estimated mean squared error is selected. 

Several attempts have been made to extend CART to longitudinal and 
multiresponse data, often by using likelihood-based functions as node impu- 
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rity measure. The earliest attempt for longitudinal data seems to be Segal 
(1992), which uses the likelihood of an autoregressive or compound symme- 
try model. If values are missing from the Y variable, parameter estimation 
is performed by the EM algorithm. Computational difficulties in estimat- 
ing the covariance matrices limit the method to data observed at equally- 
spaced time points. Abdolell et al. (2002) follow the same approach, but use 
a likelihood-ratio test statistic as impurity function. 

Zhang (1998) extends the CART approach to multiple binary response 
variables, assuming there are no missing values in the Y variable. It uses 
as impurity function the log-likelihood of an exponential family distribution 
that depends only on the linear terms and the sum of second-order products 
of the responses. Zhang and Ye (2008) extend this idea to ordinal responses 
by first transforming them to binary-valued indicator functions. Again, the 
approach is hindered by the computational difficulties of having to compute 
covariance matrices at every node. 

De'ath (2002) avoids the covariance computations by following the CART 
algorithm exactly except for two simple modifications: the sample mean is re- 
placed by the (i-dimensional sample mean and the node impurity is replaced 
by i(t) = J2k=\ik(t), where is the sum of squared deviations about 
the mean of the kth response variable in t. The algorithm is implemented 
in the R package MVPART (De'ath, 2012). Larsen and Speckman (2004) 
adopt the same approach, but use Mahalanobis distance as node impurity, 
with covariance matrix estimated from the whole data set. 

In a different direction, Yu and Lambert (1999) treat each longitudinal 
data vector as a random function or trajectory. Instead of fitting a longitu- 
dinal model to each node, they first reduce the dimensionality of the whole 
data set by fitting each data trajectory with a low-order spline curve. Then 
they use the estimated coefficients of the basis functions as multivariate 
responses to fit a regression tree model, with the mean coefficient vectors 
as predicted values and standardized squared error as node impurity. They 
recover the predicted trajectory in each node by reconstituting the spline 
function from the mean coefficient vector. They mention as an alternative 
the use of principal component analysis to reduce the data dimension and 
then fitting a multivariate regression tree model to the largest principal 
components. 

A major weakness of CART is that its selection of variables for splits is 
biased toward certain types of variables. Because a categorical X with m 
unique values allows (2 m_1 — 1) splits of the data and an ordinal X with n 
unique values allows (n — 1) splits, categorical variables with many unique 
values tend to have an advantage over ordinal variables in being selected 
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(Loh and Shih, 1997; Shih, 2004; Strobl, Boulesteix and Augustin, 2007). 
This weakness is inherited by all multivariate extensions of CART, includ- 
ing MVPART (Hsiao and Shih, 2007). Further, because reductions in node 
impurity from splitting are based on observations without missing values, 
variables with fewer missing values are more likely to yield larger reductions 
(and hence be selected for splitting) than those with more missing values; 
see Section 5 below. 

GUIDE (Loh, 2002) avoids selection bias by replacing CART's one-step 
method of simultaneously selecting the split variable X and split set with 
a two-step method that first selects X and then finds the split set for the 
selected X. This approach makes it practicable for GUIDE to fit a non- 
constant regression model in each node. 

The goal of this article is to extend GUIDE to multivariate and longitudi- 
nal response variables. Section 2 briefly reviews the GUIDE variable selection 
method for univariate response variables. Section 3 extends it to multivariate 
responses and longitudinal data observed at fixed time points. The procedure 
is illustrated with an application to a data set on the strength and viscosity 
of concrete. Section 4 compares the selection bias and prediction accuracy of 
our method with MVPART in a simulation study. Section 5 deals with the 
problem of missing values, which can occur in the predictor as well as the 
response variables. We propose a solution and apply it to some data on the 
mental health of children that are analyzed in Fitzmaurice, Laird and Ware 
(2004) with a generalized estimating equation (GEE) approach. Section 6 
further extends our method to longitudinal data with random time points. 
We illustrate it with an example on the hourly wages of high-school dropouts 
analyzed in Singer and Willett (2003) with linear mixed effect (LME) mod- 
els. Section 7 compares the prediction accuracy of our method with that of 
GEE and LME models in a simulation setting. Section 8 applies the ideas to 
simultaneously modeling two longitudinal series from a study on maternal 
stress and children illness analyzed in Diggle et al. (2002) with GEE logis- 
tic regression. Section 9 gives conditions for asymptotic consistency of the 
multivariate regression tree function estimates and Section 10 concludes the 
article with some remarks. 

2. Univariate GUIDE algorithm. The GUIDE algorithm for a uni- 
variate response variable Y can fit a linear model in each node using one of 
several loss functions. For our purposes here, it suffices to review the algo- 
rithm for least-squares piecewise-constant models. The key idea is to split a 
node with the X variable that shows the highest degree of clustering in the 
signed residuals from a constant model fitted to the data in the node. If a 
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Fig 1. Plots ofY versus Xi and X 2 from data generated from the model Y — X\ +e. The 
horizontal line marks the sample mean ofY. 



Table 1 

Contingency tables of X\ and X2 versus signs of residuals. Chi-squared p-values are 



0.005 


and 0.404, respectively. 
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predictor variable X has no effect on the true regression mean function, a 
plot of the residuals versus X should not exhibit systematic patterns. But if 
the mean is a function of X, clustering of the signed residuals is expected. 

To illustrate, consider some data generated from the model Y = X\ + e, 
with X\ and X2 independent C7(— 1.5, 1.5), i.e., uniformly distributed on the 
interval (—1.5, 1.5), and e independent standard normal. Since the true re- 
gression function does not depend on X2 , a piecewise-constant model should 
split on X\ only. This is easily concluded from looking at plots of Y versus 
X\ and X2, as shown in Figure 1. In the plot of Y versus X\, the positive 
residuals are clustered at both ends of the range of X\ and the negative 
residuals near the center. No such clustering is obvious in the plot of Y 
versus X2. 

GUIDE measures the degree of clustering by means of contingency ta- 
ble chi-squared tests. In each test, the values of X are grouped into a small 
number of intervals (indicated by the vertical dashed lines in Figure 1), with 
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the groups forming the rows and the residual signs forming the columns of 
the table. The end points are computed such that each interval has approx- 
imately the same number of observations if X is uniformly distributed (see 
Algorithm 3.1 below for the definitions). Table 1 shows the table counts 
and the chi-squared p- values for the data in Figure 1. If X is a categorical 
variable, its values are used to form the rows of the table. 

GUIDE selects the variable with the smallest chi-squared p-value to split 
the node. Because the sample size in a node decreases with splitting, the 
p- values are approximate at best. Their exact values are not important, 
however, as they serve only to rank the variables for split selection. Similar 
p- value methods have been used in classification tree algorithms; e.g., F-tests 
(Loh and Shih, 1997) and permutation tests (Hothorn, Hornik and Zeileis, 
2006). One benefit from using p- values is lack of selection bias, at least for 
sufficiently large sample sizes. This is due to the p-values being approxi- 
mately identically distributed if all the X variables are independent of Y. 

After a variable is selected, the split set is found by exhaustive search 
to maximize the reduction in sum of squared residuals. A side (but practi- 
cally important) benefit is significant computational savings over the CART 
method of searching for the best split set for every X. The procedure is ap- 
plied recursively to construct an overly large tree. Then the tree is pruned 
using cross-validation as in the CART algorithm and the subtree with the 
smallest cross-validation estimate of mean squared error is selected. 

3. Multiple response variables. We motivate our extension of GUIDE 
to multiple response variables with an analysis of some data on the strength 
and viscosity of concrete (Yeh, 2007) taken from the UCI Machine Learning 
Repository (Asuncion and Newman, 2007). There are 103 complete obser- 
vations on seven predictor variables (cement, slag, fly ash, water, superplas- 
ticizer (SP), coarse aggregate, and fine aggregate, each measured in kg per 
cubic meter) and three response variables (slump and flow, in cm, and 28- 
day compressive strength in Mpa). Slag and fly ash are cement substitutes. 
Slump and flow measure the viscosity of concrete; slump is the vertical height 
by which a cone of wet concrete sags and flow is the horizontal distance by 
which it spreads. The objective is to understand how the predictor variables 
affect the values of the three response variables jointly. 

Fitting a separate multiple linear regression model to each response is 
not enlightening, as the results in Table 2 show. Cement, fly ash, water, 
and coarse aggregate are all significant (at the 0.05 level) for strength. The 
signs of their coefficients suggest that strength is increased by increasing 
the amounts of cement and fly ash and decreasing that of water and coarse 
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Table 2 

Separate linear regression models, with p-values less than 0. 05 in italics 



Slump 
Estimate P-value 


Flow 

Estimate P-value 


Strength 
Estimate P-value 


(Intercept) 


-88.525 


0.66 


-252.875 


0.47 


139.782 


0.052 


Cement 


0.010 


0.88 


0.054 


0.63 


0.061 


0.008 


Slag 


-0.013 


0.89 


-0.006 


0.97 


-0.030 


0.352 


Fly ash 


0.006 


0.93 


0.061 


0.59 


0.051 


0.032 


Water 


0.259 


0.21 


0.732 


0.04 


-0.233 


0.002 


SP 


-0.184 


0.63 


0.298 


0.65 


0.10 


0.445 


CoarseAggr 


0.030 


0.71 


0.074 


0.59 


-0.056 


0.045 


FineAggr 


0.039 


0.64 


0.094 


0.51 


-0.039 


0.178 



aggregate. Since no variable is significant for slump, one may be further 
tempted to conclude that none is important for its prediction. This is false, 
because a linear regression for slump with only water and slag as predic- 
tors finds both to be highly significant. The problem is due to the design 
matrix being quite far from orthogonal (see Figure 2). Therefore it is risky 
to interpret each regression coefficient by "holding the other variables con- 
stant." Besides, the main effect models are most likely inadequate anyway. 
Inclusion of interaction terms, however, brings on other difficulties, such as 
knowing which terms and of what order to add, which makes interpretation 
even more challenging. 

Instead of controlling for the effects of other variables by means of an 
equation, a regression tree model achieves a similar goal by dividing the 
sample space into partitions defined by the values of the variables, thus 
effectively restricting the ranges of their values. Figure 3 shows three GUIDE 
tree models, one for each response variable, with predicted values beneath 
the terminal nodes. We see that less slag and more water yield larger values 
of slump, more water yields larger values of flow, and higher amounts of 
cement and fly ash produce the strongest concrete. Although it is easier to 
interpret the tree structures than the coefficients of the linear models, it is 
still non-trivial to figure out from the three trees how the variables affect 
the response variables jointly. For example, the trees show that (i) slump is 
least when slag > 137, (ii) flow is least when water < 182 and slag > 66, 
and (iii) strength is greatest when cement > 317 and fly ash > 115. We may 
thus conclude that the intersection of these conditions yields the strongest 
and least viscous concrete. But there are no observations in the intersection. 

A single tree model that simultaneously predicts all three responses would 
not have these difficulties. Ideally, such an algorithm would produce compact 
trees with high predictive accuracy and without variable selection bias. The 
main hurdle in extending GUIDE to multiple response variables is unbiased 
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Fig 2. Plots of pairs of predictor variables for concrete data 



SLUMP FLOW STRENGTH 




41 47 



Fig 3. Univariate GUIDE models for predicting slump (left), flow (middle), and strength 
(right) of concrete. At each node, a case goes to the left subnode if and only if the stated 
condition is satisfied. The predicted value is in italics below each terminal node. 
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variable selection. Once this problem is solved, the rest of the method follows 
with a simple modification to the node impurity function. 

Lee (2005) proposes one extension, applicable to ordinal X variables only, 
that fits a GEE model to the data in each node. It classifies each observa- 
tion into one of two groups according to the sign of its average residual, 
Ylk=l wnere t-ik is the residual of the ith observation for the feth 
response variable. Then a two-sample i-test is performed for each X and the 
one with the smallest p-value is selected to split the node. The split point is 
a weighted average of the X values in the two groups. If the smallest p- value 
exceeds a pre-specified threshold, splitting stops. 

Lee's solution is deficient in several respects. First, the p-value threshold 
is hard to specify, because it depends on characteristics of the data set, 
such as the number and type of variables and the sample size. Second, it 
is inapplicable to categorical predictor variables. Third, it is inapplicable 
to data with missing predictor or response values. Finally, for the ultimate 
goal of clustering the response vectors into groups with similar patterns, 
classifying them into two groups by the signs of their average residuals is 
potentially ineffective, because two response vectors can have very dissimilar 
patterns and yet have average residuals with the same sign. 

A more effective extension can be obtained by working with the residual 
sign vectors instead. Let (Y\,Y2, . . . , Yd) be the d response variables. At each 
node, we fit the data with the sample mean vector and compute the residual 
vectors. Since each residual can have a positive or non-positive sign, there are 
2 d possible patterns for the residual sign vector. To determine if a predictor 
variable X is independent of the residual pattern, we form a contingency 
table with the sign patterns as the columns and the (grouped, if X is not 
categorical) values of X as the rows and find the p- value of the chi-squared 
test of independence. Specific details are given in the algorithm below. Other 
aspects of the method are the same as in univariate GUIDE, except for the 
node impurity function being sum of (normalized, if desired) squared errors. 

Algorithm 3.1. Split variable selection at each node t 

1. Find y = (yi, . . . , yd), where yk is the mean of the non-missing values 
of the kth response variable in t. 

2. Define the sign vector Z = (Zi, Z2, ■ ■ ■ , Zj) such that Z^ = \ ifY^> y^ 
and Zk = — 1 if Yk < y~k- IfY^ is missing, the user can choose either 
Zk = 1 or Zk = — 1 (default), with same choice used for all nodes. 

3. Main effect tests. Do this for each X variable: 

(a) If X is not categorical, group its values into m intervals. Let 
x and s denote the sample mean and standard deviation of the 
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non-missing values of X in t. If the number of data points is less 
than 5 x 2 d+2 , set m = 3 and define the interval end points to 
bex± sV3/3. Oth erwise, set m = 4 and define the interval end 
points as {x, x ± s^/3/2}. 

(b) If X is categorical, use its categories to form the groups. 

(c) Create an additional group for missing values if X has any. 

(d) Form a contingency table with the 2 d patterns of Z as columns 
and the X-groups as rows, and compute the p-value of the chi- 
squared test of independence. 

4- If the smallest p-value is less than 0.05/d, select the associated X vari- 
able and exit. 

5. Otherwise, do these interaction tests for each pair of variables Xi,Xy. 

(a) If Xi is non- categorical, split its range into two intervals An and 
Ai2 at its sample mean. If Xi is categorical, let denote the 
singleton set containing its kth value. Do the same for Xj. 

(b) Define the sets B^^ m = {(xi,Xj) : Xi £ Ai^,Xj E Aj m }, for k,m = 
1,2,.... 

(c) Form a contingency table with the Z patterns as columns and 
{Bk,m} as rows and compute its p-value. 

6. If the smallest p-value from the interaction tests is less than 0.05/ {d{d— 
1)}, select the associated pair of predictors. 

7. Otherwise, select the X with the smallest main effect p-value from 
step 4- 

The value of m in step 3a is chosen to keep the row-dimension of the 
table as small as possible without sacrificing its ability to detect patterns. 
The interval end points are chosen so that if X has a uniform distribution, 
each interval has roughly the same number of observations. If d = 1, these 
definitions reduce to those in the univariate GUIDE algorithm (Loh, 2009, 
p. 1716). 

If a non-categorical variable is selected in step 4, the split X < c is 
found by searching over all midpoints c of consecutive order statistics to 
minimize the total sum of squared deviations of the the two subnodes. If 
X is a categorical variable, the search for a split of the form X £ A can 
be computationally daunting if X takes many values. To obtain a quick but 
approximate solution, we create a classification variable from the Z patterns 
in each node and then use the method described in Loh (2009, Appendix) 
for classification trees to find the set A. We also use the procedures in Loh 
(2009) to find the split set if a pair of variables is selected in step 6. 
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Table 3 

Contingency table formed by cross-tabulating residual signs vs. water groups 
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Water < 185.5 
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16 
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185.5 < Water < 208.8 


6 


2 


1 





4 


1 


14 


13 


Water > 208.8 


3 


2 


1 


1 








13 


8 



For the concrete data, the values of each X variable are grouped into 
three intervals. Table 3 shows the contingency table formed by the residual 
signs and the groups for water, which has the smallest chi-squared p- value of 
8 x 1CP 5 . The top half of Figure 4 shows the tree model after pruning by ten- 
fold cross-validation. We will use the description "multivariate GUIDE" to 
refer to this method from now on. Predicted values of the response variables 
are shown by the heights of the bars in the figure. The strongest and most 
viscous concrete is obtained with water < 182 kg/m 3 and coarse aggregate 

< 960 kg/m 3 . This is consistent with these two variables having negative 
coefficients for strength in Table 2. The tree model also shows that the 
combination of water > 182 kg/m 3 , cement > 180 kg/m 3 , and fly ash > 117 
kg/m 3 yields concrete that is almost as strong but least viscous. Thus it is 
possible to make strong concrete with low or high viscosity. The combination 
predicting concrete with the least strength is water > 182 kg/m 3 and cement 

< 180 kg/m 3 . The MVPART (De'ath, 2012) model is shown in the bottom 
half of Figure 4. Its first split is the same as that of GUIDE, but the next 
two splits are on slag. 

To compare the prediction accuracy of the methods, we first normalize the 
values of the three response variables to have zero mean and unit variance 
and then apply leave-one-out cross-validation to estimate their sum of mean 
squared prediction errors of the pruned trees, where the sum is over the three 
response variables. The results are quite close, being 1.957, 2.097, and 2.096 
for univariate GUIDE, multivariate GUIDE, and MVPART, respectively. As 
we will see in the next section, univariate trees tend to have lower prediction 
error than multivariate trees if the response variables are uncorrelated and 
higher prediction error when the latter are correlated. In this example, slump 
and flow are highly correlated (cor = 0.91) but each is weakly correlated with 
strength (-0.22 and -0.12, respectively). Thus there is a cancellation effect. 

4. Selection bias and prediction accuracy. We carried out some 
simulation experiments to further compare the variable selection bias and 
prediction accuracy of GUIDE and MVPART. To show the selection bias of 
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Fig 4. Multivariate GUIDE (top) and MVPART (bottom) models for the concrete data. 
Sample sizes are beneath and predicted values (slump, flow, and strength, resp.) are on the 
left of each node. Barplots show the predicted values in the terminal nodes of the trees. 
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Fig 5. Estimated probabilities (based on 5000 simulation trials) that each predictor vari- 
able is selected to split the root node when all are independent of the response variables. 
Standard errors are less than 0. 005. Variable Ck is multinomial with equal probabilities on 
k categories. The horizontal line marks the level for unbiased selection. 



MVPART, we took the concrete data as a population distribution and drew 
bootstrap samples from it of the same size (n = 103). Then we randomly 
permuted the values in each predictor variable to render it independent 
of the response variables. An unbiased algorithm now should select each 
variable with the same probability 1/7 = 0.143 to split the root node. The 
left panel of Figure 5 shows the estimated selection probabilities for GUIDE 
and MVPART from 5000 simulation trials. The estimates for GUIDE are 
all within two simulation standard errors of 1/7 but those of MVPART are 
not: they are roughly proportional to the number of unique values of the 
variables in the data, namely, 80, 63, 58, 70, 32, 92, and 90 for cement, slag, 
fly ash, water, SP, coarse aggregate, and fine aggregate, respectively. 

To demonstrate the bias of MVPART toward selecting variables with more 
split sets, we added two independent predictor variables, C2 and C20, where 
Ck denotes a multinomial variable with equal probabilities on k categories. 
Variable C2 allows only one split but variable C 20 has 2 19 - 1 = 524,287 
splits. An unbiased method now should select each variable with probability 
1/9 = 0.111. The results, based on 5000 simulation trials, are shown in the 
right panel Figure 5. GUIDE is again essentially unbiased (within simulation 
error) but MVPART selects C20 more than 86% of the time and C2 only 10 
out of 5000 times. 

To compare the prediction accuracies of MVPART and univariate and 
multivariate GUIDE, we use three simulation scenarios, with each having 
seven predictor variables and three response variables. The values of the 
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Table 4 

Estimated mean squared error (MSE) and number of terminal nodes (Nodes) using 100 

training samples in 1000 simulation trials. Standard errors of MSE in parentheses. 
'Univariate GUIDE' refers to the model with a separate tree for each response variable. 





Univariate GUIDE 


Multivariate GUIDE 


MVPART 




Scenario 


MSE x 10 2 


Nodes 


MSE x 10 2 Nodes 


MSE x 10 2 Nodes 






Xi, ■ ■ ■ , X-j are independent U (- 


-0.5,0.5) 




(4.1) 


14.1 (0.1) 


5.7 


21.9 (0.1) 3.4 


22.2 (0.1) 


3.1 


(4.2) 


35.1 (0.2) 


8.3 


24.2 (0.2) 4.5 


22.3 (0.2) 


4.5 


(4.3) 


33.0 (0.5) 


11.8 


12.3 (0.4) 4.2 


68.8 (0.7) 


2.7 




Xi,.. 


,X 6 are N(0,V), X 7 is independent U(-0.5, 0.5) 




(4.1) 


47.2 (0.3) 


13.7 


156.0 (0.7) 6.4 


128.4 (0.6) 


13.0 


(4.2) 


198.0 (1.2) 


22.5 


206.1 (1.4) 6.3 


158.2 (1.2) 


10.9 


(4.3) 


48.0 (0.6) 


11.3 


15.5 (0.5) 4.4 


68.2 (0.8) 


3.6 



response variables are generated by the equation = Hk + e > k = 1, 2, 3, 
where the e are independent normal variables with mean and variance 
0.25. The three scenarios are: 

(4.1) (W,M2,M 3 ) = (X 1 ,X 2 ,X 3 ) 

(4.2) (W,M2,M 3 ) = (X 1 + X 2 , X 1 + X 2 , X l +X 2 ) 



(4.3) (m,H2,lJ>3) 



(1,-1,0), X 1 X 2 >0 
(0,0,1), X X X 2 <Q 



Scenarios (4.1) and (4.2) are standard linear regression models. Univariate 
GUIDE should be most accurate in scenario (4.1), because each mean re- 
sponse depends on a different predictor variable. The same may not be true 
for scenario (4.2), where a multivariate regression tree may be able to uti- 
lize the joint information among the responses variables. Scenario (4.3) has 
a piecewise-constant tree structure, but it can be challenging due to the 
absence of main effects. 

Two simulation experiments were performed. In the first experiment, 
variables Xi,...,Xj are mutually independent U(— 0.5, 0.5). For each sce- 
nario, 100 training samples are generated in each simulation trial and a 
pruned regression tree model (using the CART pruning method) constructed 
by each method. One hundred independent test values (Xij,X 2 j, ■ ■ ■ ,X?j), 
j = 1, . . . , 100, are generated to evaluate the models. Let (nij , fi 2 j , fi^j ) de- 
note the mean response values for the jth test sample, (p,\j , fi 2 j , ft^j) denote 
their predicted values, and MSE = J2j=i X^=i(A«j — ^r/') 2 /100 denote the 
estimated mean squared error. 

The upper half of Table 4 shows the average values of MSE and their 
standard errors over 1000 simulation trials. The average numbers of termi- 
nal nodes are also shown (for the univariate GUIDE method, this is the 
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sum of the number of terminal nodes of the separate trees). As expected, 
univariate GUIDE is more accurate than the multivariate tree methods in 
scenario (4.1), where the means are unrelated. On the other hand, mul- 
tivariate GUIDE is more accurate in scenarios (4.2) and (4.3) because it 
can take advantage of the relationships among the response variables. The 
accuracy of MVPART is close to that of multivariate GUIDE, except in 
scenario (4.3), where it has difficulty detecting the interaction effect. The 
higher accuracy of multivariate GUIDE here is due to the interaction tests 
in step 5 of Algorithm 3.1. 

In the second experiment, we generated (Xi, ■ ■ ■ ,Xq) as multivariate nor- 
mal vectors with zero mean and covariance matrix 

/l r r 0\ 

1 r r 

r 1 r 

~~ r r 1 

r 1 r 

\0 r r 1/ 

and r = 0.5. Thus (X\, X3, X4) is independent of (X2, X$, Xq). As in the 
previous experiment, Xj is independent U{— 0.5, 0.5). The results, given in 
the bottom half of the table, are quite similar to those in the first experiment, 
except in scenario (4.1), where MVPART has lower MSE than multivariate 
GUIDE and in scenario (4.2), where univariate GUIDE has lower MSE than 
multivariate GUIDE. Notably, the average number of terminal nodes in the 
MVPART trees is about twice the average for multivariate GUIDE in these 
two scenarios. The larger number of nodes suggest that the trees may be 
splitting on the wrong variables more often. But because these variables are 
correlated with the correct ones and because of the effectiveness of pruning, 
the MSEs are not greatly increased. 

5. Missing values. Missing values in the predictor variables do not 
present new challenges, as the method in univariate GUIDE can be used as 
follows (Loh, 2009). If X has missing values, we create a "missing" group 
for it and carry out the chi-squared test with this additional group. Besides 
allowing all the data to be used, this technique can detect relationships 
between the missing patterns of X and the values of the response variables. 

The search for a split set for a categorical X with missing values is no 
different from that for a categorical variable without missing values, be- 
cause missing values are treated as an additional category. But if X is non- 
categorical and has missing values, we need to find the split point and a 
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Fig 6. Estimated probabilities (from 5000 simulation trials) of variable selection when all 
variables are independent of the response variables. Standard errors are less than 0.005. 
The horizontal line marks the probability for unbiased selection. 



method to send cases with missing values through the split. For the first 
task, all splits at mid-points between consecutive order statistics of X are 
considered. All missing X values are temporarily imputed with the mean 
of the non-missing values in the node. Because the sample mean usually 
belongs to the node with the greater number of observations, this typically 
sends the missing values to the larger node. The best among these splits 
is then compared with the special one that sends all missing values to one 
node and all non-missing values to the other, and the one yielding the greater 
impurity reduction is selected. 

Our approach to split selection is different from that of MVPART, which 
uses the CART method of searching for the split that maximizes the re- 
duction in total sum of squared errors among the observations non-missing 
in the split variable. As a consequence, MVPART has a selection bias to- 
ward variables with fewer missing values. This can be demonstrated using 
the procedure in section 4, where we take bootstrap samples of the con- 
crete data and randomly permute its predictor values. Figure 6 shows the 
selection probabilities before and after 80% of the values in FineAggr are 
made randomly missing, based on 5000 simulation trials. Variations in the 
GUIDE probabilities are all within three simulation standard errors of 1/7, 
but those of MVPART are not. More importantly, there is a sharp drop in 
the selection probability of FineAggr due to missing values. 

Missing values in a univariate response variable do not cause problems, be- 
cause those observations are routinely omitted. But if the response is multi- 



16 



W-Y. LOH AND W. ZHENG 



Table 5 

Estimated GEE model for children mental health data, from 
Fitzmaurice, Laird and Ware (2004, P- 438) 



Variable 


Estimate 


SE 


Z 


Intercept 


-1.685 


0.100 


-16.85 


Parent informant (Xi) 


-0.467 


0.118 


-3.96 


Single parent status (X2) 


0.611 


0.108 


5.68 


Fair or poor child health (X3) 


0.146 


0.135 


1.08 


Informant x child health (X1X3) 


0.452 


0.157 


2.87 



dimensional, it is wasteful to omit an observation simply because one or 
more responses are missing, as MVPART and the methods of Abdolell et al. 
(2002) and Zhang (1998) require. Segal (1992) allows missing responses in 
longitudinal data, but only if the variable is continuous, is observed at 
equally-spaced time points, and the data in each node are fitted with an 
autoregressive or compound symmetry model. In our approach, if there are 
missing values in some but not all response variables, step 2 of Algorithm 3.1 
takes care of them by giving the user the choice of = — 1 or = 1 for 
missing Y^. For split set selection, we compute the mean response for each 
Yk from the non-missing values in the node and the sum of squared errors 
from the non-missing Yk values only. 

To illustrate these ideas, consider a data set from a survey of the men- 
tal health of 2501 children, analyzed in Fitzmaurice, Laird and Ware (2004, 
Sec. 16.5). One purpose of the survey was to understand the influence of 
parent status (single vs. not single) and child's physical health (good vs. 
fair or poor) on the prevalence of externalizing behavior in the child. Each 
child was assessed separately by two "informants" (a parent and a teacher) 
on the presence or absence (coded 1 and 0, respectively) of delinquent or 
aggressive externalizing behavior. All the parent responses were complete, 
but 1073 children (43%) did not have teacher responses. 

For child i, let Yjj = 1 if the jth informant (where j = 1 refers to parent 
and j = 2 to teacher) reports externalizing behavior, and Yy = otherwise. 
Assuming that the Kj are missing at random and the covariance between 
the two responses is constant, Fitzmaurice et al. use a generalized estimating 
equation (GEE) method to simultaneously fit this logistic regression model 
to the two responses: 

log{P(Yij = l)/P(Yij = 0)} = fa + PlXuj + P2X 2 ij + P3X3ij + Pl3XlijX 3 ij. 

Here xuj = 1 if j = 1 and otherwise, X2ij = 1 if the parent is single and 
otherwise, and x^ij = 1 if the child's health is fair or poor and otherwise. 
Table 5 shows the estimated coefficients from Fitzmaurice et al. (p. 438). It 
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Fig 7. Multivariate GUIDE tree model for children mental health data. A case goes to 
the left branch at each intermediate node if and only if the condition on its left is sat- 
isfied. Sample sizes are given beneath the terminal nodes. The barplots below them give 
the proportions of parents (P) and teachers (T) reporting externalizing behavior and the 
proportions missing teacher responses. 



suggests that a report of externalizing behavior is more likely if the informant 
is a teacher or the parent is single. The significant interaction implies that 
the probability is further increased if the informant is a parent and the child 
has fair or poor health. 

The multivariate GUIDE model, using = — 1 for missing values in 
step 2 of Algorithm 3.1, is shown in Figure 7. It splits first on child health 
and then on single parent status. (The model using Zf~ = 1 for missing 
splits first on single parent status and then on child health, but its set of 
terminal nodes is the same.) The barplots below the terminal nodes com- 
pare the predicted proportions (means of Yy) of the parents and teachers 
who report externalizing behavior and the proportions of missing teacher 
responses. The interaction effect in the GEE model can be explained by 
the barplots: parent reports of externalizing behavior are less frequent than 
teacher reports except when the child's health is not good and the parent is 
not single. The main effect of single parent status is also clear: both parent 
and teacher reports are more frequent if the parent is single. Further, chil- 
dren of single parents are more likely to be missing teacher reports. Figure 8 
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Fig 8. MVPART tree model for children mental health data. A case goes to the left branch 
at each intermediate node if and only if the condition on its left is satisfied. Sample sizes 
are given beneath the terminal nodes. The barplots give the proportions of parents (P) and 
teachers (T) reporting externalizing behavior and the proportions missing teacher responses 
in the terminal nodes. The model uses only the cases with non-missing response values. 




shows the MVPART tree, which splits only once, on single parent status. 
One reason for its brevity is that it ignores data from the 1073 children that 
do not have teacher responses. 

6. Longitudinal data. Algorithm 3.1 is directly applicable to longitu- 
dinal data as long as they are observed on a fixed grid and the number of 
grid points is small. Since these conditions may be too restrictive, we show 
here how to modify the algorithm for broader applicability. To motivate and 
explain the changes, consider a longitudinal study on the hourly wage of 
888 male high-school dropouts (246 black, 204 Hispanic, 438 white), whose 
observation time points as well as their number (1-13) varied across indi- 
viduals. Singer and Willett (2003, Sec. 5.2.1) fit a linear mixed effect (LME) 
model to the natural logarithm of hourly wage (wage) to these data. They 
choose the transformation partly to overcome the range restriction on hourly 
wage and partly to satisfy the linearity assumption. Their model is 

(6.1) E log(wage) = (3 + /3ihgc + /3 2 exper + /3 3 black + /3 4 hisp 

+ /^exper x black + ^exper x hisp 
+ &o + friexper 

where hgc is the highest grade completed, exper is the number of years (to 
the nearest day, after labor force entry), black = 1 if a subject is black 
and otherwise, hisp = 1 if a subject is Hispanic and otherwise, and 
bo and b\ are subject random effects. The fixed-effect estimates in Table 6 
show that hgc and exper are statistically significant, as is the interaction 
between exper and black. The main and interaction effects of hisp are not 
significant. 
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Table 6 

Fixed-effect estimates for linear mixed effect model (6.1) fitted to high-school dropout data 





Value 


Std. Error 


DF 


value 


p- value 


(Intercept) 


1.382 


0.059 


5511 


23.43 


0.000 


hgc 


0.038 


0.006 


884 


5.94 


0.000 


exper 


0.047 


0.003 


5511 


14.57 


0.000 


black 


0.006 


0.025 


884 


0.25 


0.804 


hisp 


-0.028 


0.027 


884 


-1.03 


0.302 


exper x black 


-0.015 


0.006 


5511 


-2.65 


0.008 


exper x hisp 


0.009 


0.006 


5511 


1.51 


0.131 




02468 12 02468 12 02468 12 02468 12 

exper exper exper exper 



Fig 9. Trajectories of eight high-school individuals. The solid curve is the lowess fit to all 
the subjects. The signs in the plot titles are the signed values of (Z\, ^2, Z3), where Zk = 1 
if the number of observations above the lowess curve is greater than the number below the 
curve in the kth time interval, and = — 1 otherwise. 
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Let Yij denote the response of the ith subject at the jth observation 
time itij. To render Algorithm 3.1 applicable to varying numbers and values 
of Uij, we first divide the range of the Uij values into d disjoint intervals, 
Ui, U2, • • • , Ud, of equal length, where d is user selectable. Then we replace 
steps 1 and 2 of the algorithm with these two steps: 

1. At each node, apply the lowess (Cleveland, 1979) method to the data 
points (uij,Yij) to estimate the mean of the Yij values with a smooth 
curve S(u). 

2. Define = 1 for subject i if the number of observations with Y^ > 
S(uij) is greater than or equal to the number with Y^ < S(uij), for 
Uij £ Uk, k = 1,2, ... ,d. Otherwise, define = —1. (By this defini- 
tion, Zk = —1 if there are no observations in Uk)- 

With these changes, we can fit a regression tree model to the wage data. 
Since our method is not limited by range restrictions on Yij nor linearity 
assumptions, we fit the model to untransformed hourly wage, using hgc 
and race as split variables, exper as the time variable, and d = 3. Figure 9 
shows the lowess curve for the data at the root node and a sample trajectory 
for each of the eight possible values of (Z\, Z 2 , Z3). Figure 10 gives the 
pruned tree, which has five terminal nodes. The first split is on race; if 
race = white, the node is further split on hgc < 9. Lowess curves for 
the five terminal nodes are drawn below the tree. Contrary to the finding 
in Singer and Willett (2003, p. 149) that the trajectories of Hispanic and 
White subjects cannot be distinguished statistically, we see that Hispanics 
tend to have slightly lower hourly wage rates than Whites. In addition, the 
slope of the mean trajectory for Blacks with hgc < 9 appears to decrease 
after 4 years of experience, contradicting the exponential trend implied by 
the logarithmic transformation of wage in the linear mixed model. 

7. GEE and LME versus GUIDE. A simulation experiment was 
performed to compare the prediction accuracies of GEE, GUIDE, and LME. 
Two simulation models are used, each with five independent predictor vari- 
ables, Xx, X2, ■ ■ ■ , -X5, uniformly distributed on (—1,1). Longitudinal obser- 
vations are drawn at d equally spaced time points, u = 1, 2, ...,d, with 
d = 10. The models are 



(7.1) 



Y u = 1 + Xx + X 2 + 2XxX 2 + 0.5u + b + b x u + e 



u 



and 



(7.2) 



Y u = 2.5I(Xi < 0) + 0.5-u + b + b x u + e 



a 
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Fig 10. Multivariate GUIDE tree for high-school dropout data on left; lowess-smoothed 
estimates of mean hourly wage by leaf node on right. At an intermediate node, a case goes 
to the left branch if and only if the given condition is satisfied; sample sizes are given 
beneath the terminal nodes. 
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Table 7 

Estimated mean squared errors for LME, GEE and GUIDE with standard errors 





LME 


GEE 


GUIDE 


Model (7.1) 
Model (7.2) 


1.00 ±0.01 
0.49 ±0.01 


1.12 ±0.01 
0.60 ±0.01 


1.27 ±0.03 
0.12 ±0.01 



where bo ~ iV(0, 0.5 2 ) and b\ ~ N(0, 0.25 2 ) are random effects, e u is standard 
normal, and all are mutually independent. The fitted model in both cases is 

Y u = fa + p x X x + fcX 2 + ... + /3 5 X 5 + /3 6 u + b + hu + e u 

and the parameters, /?o, • • • , are estimated using the R packages lme4 
(Bates, 2011) and geepack (Yan, Hojsgaard and Halekoh, 2012) for LME and 
GEE, respectively, with GEE employing a compound symmetry correlation 
structure. Model (7.1) is almost perfect for LME and GEE except for the 
interaction term and model (7.2) is almost perfect for GUIDE except for the 
terms linear in u. 

For each simulation trial, a training set of two hundred longitudinal 
data series are generated from the appropriate simulation model. Estimates 
f(u, xi,X2, • • • , X5) of the conditional mean E(y u \x\,X2, ■ ■ ■ , £5) are obtained 
for each method on a uniform grid of m = 6 5 = 7776 points (xn , Xi2, ■ ■ ■ , x^) G 
(— 1, l) 5 and the mean squared error 

m d 2 

MSE = (dm)' 1 ^2 {/(^) ^1)^*2) • • • 1^*5) - E(y u \xa,x i2 , . . . ,x i5 )\ 

1=1 u=l 

recorded. Table 7 shows the average values of the MSE and their esti- 
mated standard errors from 200 simulation trials. There is no uniformly 
best method. LME is best in model (7.1) and GUIDE is best in model (7.2). 
Because it makes fewer assumptions, GEE has a slightly higher MSE than 
LME in both models. 

8. Time-varying covariates and multiple series. Our approach re- 
quires all predictor variables to be fixed with respect to time. An example 
where there is a time- varying covariate is the Mothers' Stress and Children's 
Morbidity study reported in Alexander and Markowitz (1986) and analyzed 
in Diggle et al. (2002, Chap. 12). In this study, the daily presence or absence 
of maternal stress and child illness in 167 mother-child pairs was observed 
over a four-week period. The children ranged in age from 18 months to 
5 years. Time-independent variables, measured at the start of the study, are 
mother's marital and employment status (both binary), education level and 
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health (both ordinal with 5 categories), child's race and sex (both binary), 
child's health (ordinal with 5 categories), and household size (3 or fewer vs. 
more than 3 people). Diggle et al. (2002) use GEE logistic regression models 
to answer the following questions. 

1. Is there an association between mother's employment and child illness? 

2. Is there an association between mother's employment and stress? 

3. Does mother's stress cause child illness or vice versa? 

For predicting child illness, their GEE model shows that day (since enroll- 
ment), mother's marital status, child's health and race, and household size 
are statistically significant, but mother's employment is not. Our method 
gives a trivial tree with no splits after pruning, suggesting that no variable 
other than day has predictive power. For predicting mother's stress, their 
GEE model finds that day, mother's health, marital status and education, 
child's health, household size, and the interaction between day and employ- 
ment are significant. Our pruned tree has two terminal nodes, separating 
children that have very good health from those that do not. Figure 11 shows 
plots of the observed and lowess-smoothed mean frequencies of mother's 
stress, grouped by mother's employment status (left) and by child health as 
found by our tree model (right). The curves defined by employment cross 
over, lending support to the significance of the day-employment interaction 
effect found in the GEE model. The large separation between the two curves 
defined by child's health, on the other hand, indicates a large main effect. 

On the third question of whether mother's stress causes child's illness 
or vice versa, Diggle et al. (2002) find, by fitting GEE models with lagged 
values of stress and illness as additional predictors, that the answer can be 
both. They conclude that there is evidence of feedback, where a covariate 
both influences and is influenced by a response. Instead of trying to deter- 
mine which is the cause and which is the effect, we fit a regression tree model 
that simultaneously predicts mother's stress and child's illness by concate- 
nating the two series into one long series with 56 observations. Choosing 
d = 8 (four intervals each for stress and illness), we obtain the results in 
Figure 12, which shows that mother's health and household size are the 
most important predictors. The plots below the tree confirm that mother's 
stress (dashed curves) and child's illness (solid curves) vary together. More 
interesting is that the two responses do not decrease monotonically with 
time. In particular, when mother's health is fair or worse and household size 
is three or less, the frequencies of mother's stress and child's illness tend 
to decrease together in the first half and increase together in the second 
half of the study period. This behavior is ruled out by the GEE model of 
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Fig 11. Average and smoothed frequencies of mother's stress by employment and child 
health 



Diggle et al. (2002). We are thus reminded that the statistical significance 
of the terms in a parametric model always depends on the model being cor- 
rectly specified. If the specification is correct, the parametric approach will 
often possess greater sensitivity; otherwise important features of the data 
may be undetected. 

9. Asymptotic consistency. We give some conditions for asymptotic 
consistency of the regression function estimates, as the training sample size 
increases, for multiresponse and longitudinal data models. The conditions 
generalize those for univariate responses in Chaudhuri et al. (1994, 1995), 
Chaudhuri and Loh (2002), and Kim et al. (2007). We assume that there is 
a true regression function g(x,u), where x is a vector of predictor variable 
values in a compact set, u is the observation time in a compact set U, 
and sup ux |g(x, it)| < oo. The training data consist of vectors (yij, Xj, w^-), 
i = 1, . . . , M and j = 1, . . . ,mj, where yij is the observed response of subject 
i at time Uij £ U, Xj is the corresponding x value, and = g(xj, Uj.,) + €ij. 
The ejj's are assumed to have zero mean, constant (finite) variance, and to 
be independent of for all i and j. This setup applies to the multiresponse 
model as well, because it can be treated as a longitudinal model with fixed 
time points. Let N = J2i=i m % denote the total number of data points and 
let Tjv denote the collection of terminal nodes of a regression tree obtained 
by partitioning the data by its x values. Given (x*,ii*), let t* denote the 
terminal node containing x*. 
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Fig 12. Multivariate GUIDE model for simultaneously predicting maternal stress and child 
health. A case goes to the left branch at each intermediate node if and only if the condition 
on its left is satisfied. The number beneath each terminal node is the sample size. The plots 
below the tree show the observed and smoothed daily mean frequencies of mother's stress 
and child's illness. 
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9.1. Multiresponse and longitudinal data with fixed time points. Assume 
that U is a finite set. Let <5(Tjv) = mm te T N ,ueU \{{h j) '■ x i £ t,Uij = u}\ 
denote the smallest number of data points per time point across all terminal 
nodes. Define I-fa = ■ Xj G t*, Uij = u*} and let k^ denote the number 

of elements in Jjy. Assume further that the following conditions hold. 

Al. The €ij are mutually independent for all i and j. 
A2. 5(T N ) 4 oo as N -> oo. 

A3. For each u £ U, sup teTjv sup Xl X2€t |<?(xi, u) — #(x2, u)\ — >■ as N — > oo. 

Condition A2 ensures that there are sufficient observations in each terminal 
node for consistent estimation. Condition A3 requires the function to be 
sufficiently smooth; it implies that for each u £ U, <?(x, u) is uniformly 
continuous w.r.t. x in each t € Tjy. In other words, A3 assumes that the 
partitioning algorithm is capable of choosing the right splits so that within 
each node, the mean response curves are close to each other. 
The regression estimate of g(x*,u*) is 

g(x*,u*) = kx 1 J2 Vio 
(ij)ei* N 

= k N l Yl {g(*i,u*) + eij} 
by definition of 1^. Therefore 

|5(x*,u*)-5(x*,u*)| < k^\ {9fa,u*)-g(x*,u*)} +kx 1 \ e v 



Condition A3 implies that the first term on the right side of the inequality 

P 

converges to zero in probability. Condition A2 implies that kjy — > oo, which 
together with the independence and constant variance assumptions on 
imply that the second term converges to zero as well. Therefore g(x*, -u*) — > 
g(x*,u*) as N ->■ oo at every (x*,u*). 



9.2. Longitudinal data with random time points. Suppose now that U is 
a compact interval and that the u^'s are random. Let K(u) > be a kernel 
function with bandwidth h^. The estimate of g(x, u) at (x*,-u*) is 

* «v = E x ,a> Y^iK{h- N l ( Uij - u*)} yij 
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Let tin denote the smallest number of data points in the terminal nodes of 
the tree. Assume that the following conditions hold. 

Bl. The Uij values are independent and identically distributed and their 
density function f(u) is positive everywhere and does not depend on 
the Xj values, for all i and j. 

B2. sup teTjv sup{|<7(xi, u) — g{x.2,u)\ : u € U, xi, X2 € t} — > as N — > oo. 

B3. The density function of Uij is positive everywhere in U and 

(i) . / \K{u)\ du < oo, 

(ii) . lim^i.^ uK(u) = 0, 

p p p 

(iii) . rijv — > oo, — > and un^in — > oo as iV — > oo. 

B4. The error vectors = (eji, e^mj' are independent between subjects. 
For each i, has a covariance matrix with elements a-ijk such that 
&ijk = 0-2 f° r j = k an d maxj m~ x J2j^k a ijk < ^4 for some positive 
constant A. 

Condition Bl ensures that the value of u* is not constrained by the value of 
x*. Condition B2 is a stronger version of A3 and condition B3 is a standard 
requirement for consistency of kernel estimates. Condition B4 ensures that 
the correlations between the random errors are small. 
Write 

g(x*,u*) - g(x*,u*) 

Ex^ ggi K{h- N l { Uij - u*)}{ yij - g(x*,u*)} 

Ex,gt* Ej^i Kjh^juij - u*)}{g(xi, Uij ) + €jj - g(x*, u*)} 

v x . r v- , a- {/,,'! »• ;} 

Ex tgf > ggi ^{^K - n*)}{g(x*,^) - ff(x*X)} 

E Xl «* Epi^l^K-u*)} 

Ex.et* E™=i K{h N \ Uij - u*)}{g(^,u t3 ) - gjx^ujj)} 
Ex.gt* Egi Kjh^jujj - u*)}e i:j 

E Xl «* E^i^nK-"*)} 

= J\ + J 2 + J3 (say). 
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Define the local polynomial estimator (which depends on Uij's but not on 
the values of xi, . . . , xjf) 

Ex,gt* E^i Kjhjfjujj - u*)}g(x*, Uij ) 



Then J x = #(x*,u*) - g{x*,u*) 4 by condition B3 (Hardle, 1990, p. 29) 
P 

and J2 — > by condition B2. 

Note that (iV/iiv)~ 1 Ex iet *E7=i^{^ 1 (^ " «*)} ^ f(u*)fK(z)dz, 
where f(u) is the density function of the Uij. Conditions Bl and B4 imply 

2 

e (Nh N y l E^^Vy-Ote 

x t Gi* j=l 

= (Nh N )- 2 a 2 m l E[K 2 {h~ N l {u ll - «*)}] 

+ (iV/tjv)- 2 E [^{^(tiii - u*)}] 2 ^ o-^-fc 

< c 2 h^ 2 E[K 2 {h~^ (un - u*)}] 

+ A(iV/ lAr )- 2 [ J B J FC{/ i - 1 (n il -^)}] 2 £ rm 

XiGt* 

= a 2 {Nh N )- l f(u*) J K 2 (z)dz + AN- l ^J K{z)dz^ + o(l) 
-»■ 0. 

It follows that J3 -A and hence g(x*,u*) 4 g(x.*,u*) as AT — >■ 00. 

10. Concluding remarks. Previous algorithms for fitting regression 
trees to multiresponse and longitudinal data typically follow the CART ap- 
proach, with various likelihood-based node impurity functions. Although 
straightforward, this strategy has two disadvantages: the algorithms inherit 
the variable selection biases of CART and are constrained by computational 
difficulties due to maximum likelihood and covariance estimation at every 
node of the tree. 

To avoid these problems, we have introduced an algorithm based on the 
univariate GUIDE method that does not have selection bias and does not 
require maximization of likelihoods nor estimation of covariance matrices. 
Unbiasedness is obtained by selecting the split variable with contingency 
table chi-squared tests, where the columns of each table are defined by the 
patterns of the data trajectories relative to the mean trajectory and the 
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rows are defined by the values of a predictor variable. The mean trajectory 
is obtained by applying a nonparametric smoother to the data in the node. 
For split set selection and for tree pruning, the node impurity is defined as 
the total, over the number of response variables, of the sum of (optionally 
normalized) squared errors for each response variable. Correlations among 
longitudinal response values are implicitly accounted for by the smoothing 
and the residual trajectory patterns. 

Because no assumptions are made about the structure of the model in 
each node, it is quite possible that our method is less powerful than other 
tree methods in situations where the assumptions required by the latter are 
satisfied. (These assumptions, such as autoregressive models, are hard to 
justify because they need to be satisfied within random partitions of the 
data.) What we lose in sensitivity, though, we expect to gain in robustness. 
Besides, the simplicity of our smoothing and means-based approach lends 
itself more easily to asymptotic analysis. Further, as is evident from the lon- 
gitudinal data examples, plots of the smoothed mean trajectories in the ter- 
minal nodes provide a visual summary of the data that is more realistic than 
the necessarily more stylized summaries of parametric or semi-parametric 
models. 

Our approach should not be regarded, however, as a substitute for para- 
metric and semi-parametric methods such as GEE and LME for longitudinal 
data. Because the latter methods assume a parametric model for the mean 
response function, they permit parametric statistical inference, such as sig- 
nificance tests and confidence intervals, to be performed. No such inference 
is possible for regression tree models, as there are no model parameters in 
the traditional sense. Regression tree models are simply approximations to 
the unknown response functions, whatever they may be, and are meant for 
descriptive and prediction purposes. Although GEE and LME models can be 
used for prediction too, their constructions are based on significance tests, 
unlike tree models which are focused on prediction error. In applications 
where the sample size and number of predictor variables are small and the 
model is correctly specified, GEE and LME will always be more powerful 
than tree methods, due to the extra information provided by the parametric 
model. But if the sample size or the number of predictor variables is large, 
it can be challenging to select the right parametric model. It is in such sit- 
uations that a regression tree model can be quite useful because it provides 
a relatively simple and interpretable description of the data. The fitted tree 
model can serve a variable selection purpose as well, by identifying a subset 
of predictor variables for subsequent parametric modeling, if desired. 

The proposed method is implemented in the GUIDE software which can 
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be obtained from www.stat.wisc.edu/~loh/guide.html. 
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